Method and apparatus for storing and delivering documents of the internet

ABSTRACT

An improved method and apparatus is used for storing and delivering information over the Internet and using Internet technologies. According to one embodiment of the present invention, a method and apparatus is disclosed for validating a collection of data.

RELATED INVENTIONS

[0001] The present application is a division of application Ser. No.09/379,376 filed on Aug. 23, 1999, which is a continuation ofapplication Ser. No. 08/897,786, filed on Jul. 21, 1997 now U.S. Pat.No. 6,038,601. The above mentioned applications are hereby incorporatedherein by reference.

FIELD OF INVENTION

[0002] The present invention relates to the field of Internet andwide-area networking technology. Specifically, the present inventionrelates to the storage and delivery of information over the Internet andusing Internet technologies.

DESCRIPTION OF RELATED ART

[0003] The World Wide Web (the Web) represents all of the computers onthe Internet that offers users access to information on the Internet viainteractive documents or Web pages. Web information resides on Webservers on the Internet or within company networks. Web client machinesrunning through Web browsers or other internet software can access theseWeb pages via a communication protocol; known as HyperText transportprotocol (HTTP). With the proliferation of information on the Web andinformation accessible in company networks, it has become increasinglydifficult for users to locate and effectively use this information. Assuch, the mode of storing, delivering, and interacting with data in theInternet, and the Web in particular, has changes over time.

[0004]FIG. 1A illustrates a typical Internet configuration comprisingclient 100 and content provider 102 coupled via the Internet. Thecontent provider may include a media company, a consumer service, abusiness supplier, or a corporate information source inside thecompany's network.

[0005] The use of information within a wide-area network such as theInternet poses problems not usually experienced in smaller, local-areanetworks. The latency of the Internet produces delays that become theperformance bottleneck in retrieving information. Clients may beconnected to the network only part of the time, but still want access toinformation from their local platform that was retrieved from thecontent provider prior to being disconnected. The granularity andindependence of the objects in a wide-area network, particularly theInternet, make the task of aggregating them more difficult.

[0006] The use of client and intermediate caching of the contentprovider information may alleviate some of the problems of the wide-areanetwork interactions. Certain implementations today perform this cachingon behalf of the client, but sacrifice data timeliness and do notaddress performance problems because they must validate their caches insingle operations over the network.

[0007]FIG. 1A also illustrates the typical Internet configuration ofclient-to-content-provider interaction. Subsequent to connecting to theInternet, client 100 will generally request objects from the contentprovider 102. The client must locate the information, often throughmanual or automatic searches, then retrieve the data through the client.

[0008] When searching for the data initially, this “pull” model providesgreat utility in locating information. Implicit in the model, however,is that the client machine has the responsibility for finding anddownloading data as desired. The user is faced with the problem ofhaving to scour the Web for various information sites that may be ofinterest to him or her. Although this model provides a user with a largedegree of flexibility in terms of the type of information that he or shewould like to access each time he or she connects to the Internet, thereis clearly a downside to the model in that the user is forced toconstantly search for information on the Internet. Given the exponentialrate of growth of data on the Internet, this type of searching isbecoming increasingly cumbersome.

[0009] While the pull model is effective for finding information, once auser has found an information source—a location from which subsequentinformation of interest to the user will be distributed—he or she mustcontinue to check for new information periodically. In the “pull” model,the server is inherently passive and the client does all the work ofinitiating requests. If the server has new information of interest tothe client, the server has no method of delivering either theinformation or a notification to the client that the information exists.The content provider cannot, in the pull model, provide an “informationservice” where active server information is identified, then passed tothe client in terms of some kind of notification.

[0010] It is therefore an object of the present invention to provide amethod to manage passive and active data throughout the network, andoffer an improved method and apparatus for storing and deliveringinformation on the Internet.

SUMMARY OF THE INVENTION

[0011] The present invention discloses an improved method and apparatusfor storing and delivering information over the Internet and usingInternet technologies. According to one embodiment of the presentinvention, a method and apparatus is disclosed for validating acollection of data.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] The present invention is illustrated by way of example and not byway of limitation in the figures of the accompanying drawings in whichlike reference numerals refer to similar elements and in which:

[0013]FIG. 1A illustrates a typical Internet configuration

[0014]FIG. 1B illustrates a typical computer system in which the presentinvention operates

[0015]FIG. 2 illustrates the three major components according to oneembodiment of the present invention

[0016]FIG. 3 illustrates an overview of how the three components of oneembodiment of the present invention interact with each other.

[0017] FIGS. 4-6 are flow charts illustrating embodiments of the presentinvention

[0018]FIG. 7A illustrates a regular expression used as a “positivefilter” for links on a page

[0019]FIG. 7B illustrates a regular expression used as a “negativefilter” for links on a page

[0020]FIG. 8 is a flow chart illustrating one embodiment of the presentinvention

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0021] The present invention relates to a method and apparatus forstoring and delivering documents on the Internet. In the followingdetailed description, numerous specific details are set forth in orderto provide a thorough understanding of the present invention. It will beapparent to one of ordinary skill in the art however, that thesespecific details need not be used to practice the present invention. Inother instances, well known structures, interfaces and processes havenot been shown in detail in order not to unnecessarily obscure thepresent invention.

[0022]FIG. 1B illustrates a typical computer system 100 in which thepresent invention operates. One embodiment of the present invention isimplemented on a personal computer architecture. It will be apparent tothose of ordinary skill in the art that other alternative computersystem architectures may also be employed.

[0023] In general, such computer systems as illustrated by FIG. 1Bcomprise a bus 101 for communicating information, a processor 102coupled with the bus 101 for processing information, main memory 103coupled with the bus 101 for storing information and instructions forthe processor 102, a read-only memory 104 coupled with the bus 101 forstoring static information and instructions for the processor 102, adisplay device 105 coupled with the bus 101 for displaying informationfor a computer user, an input device 106 coupled with the bus 101 forcommunicating information and command selections to the processor 102,and a mass storage device 107 coupled with the bus 101 for storinginformation and instructions. A data storage medium 108, such as amagnetic disk and associated disk drive, containing digital informationis configured to operate with mass storage device 107 to allow processor102 access to the digital information on data storage medium 108 via bus101.

[0024] Processor 102 may be any of a wide variety of general purposeprocessors or microprocessors such as the PENTIUM® brand processormanufactured by INTEL® Corporation. It will be apparent to those ofordinary skill in the art, however, that other varieties of processorsmay also be used in a particular computer system. Display device 105 maybe a liquid crystal device, cathode ray tube (CRT), or other suitabledisplay device. Mass storage device 107 may be a conventional hard diskdrive, floppy disk drive, CD-ROM drive, or other magnetic or opticaldata storage device for reading and writing information stored on a harddisk, a floppy disk, a CD-ROM a magnetic tape, or other magnetic oroptical data storage medium. Data storage medium 108 may be a hard disk,a floppy disk, a CD-ROM, a magnetic tape, or other magnetic or opticaldata storage medium.

[0025] In general, processor 102 retrieves processing instructions anddata from a data storage medium 108 using mass storage device 107 anddownloads this information into random access memory 103 for execution.Processor 102, then executes an instruction stream from random accessmemory 103 or read-only memory 104. Command selections and informationinput at input device 106 are used to direct the flow of instructionsexecuted by processor 102. Equivalent input device 106 may also be apointing device such as a conventional mouse or trackball device. Theresults of this processing execution are then displayed on displaydevice 105.

[0026] Computer system 100 includes a network device 110 for connectingcomputer system 100 to a network. The network device 110 for connectingcomputer system 100 to the network includes Ethernet devices, datamodems and ISDN adapters. It will be apparent to one of ordinary skillin the art that other network devices may also be utilized.

[0027] The preferred embodiment of the present invention is implementedas a software module, which may be executed on a computer system such ascomputer system 100 in a conventional manner. Using well knowntechniques, the application software of the preferred embodiment isstored on data storage medium 108 and subsequently loaded into andexecuted within computer system 100. Once initiated, the software of thepreferred embodiment operates in the manner described below.

[0028] 1. Introduction

[0029] The presently claimed invention improves use of the Web andwide-area networks by managing groups of network objects (content orapplications) and bringing that content and notifications from serversdirectly to desktops in a timely fashion and while consuming a minimalamount of desktop screen space. Users subscribe to “channels”, whichautomatically bring new content to the user's machine and renderinformation in summary form. Detailed content attached to thesubscription is rendered in the user's web browser, and automaticallypre-fetched by one embodiment of the present invention.

[0030] Channel configuration and rendition, as well as related contentpre-fetch are all controlled by default from the content provider via aback-end server, thus giving content providers a large amount of controlover their own data and how it is presented. Content providers cansupply their own graphics, advertising, ticker information, animationcontrol, and content refresh parameters. Subscribing to channels is alsosimple: content providers simply place an icon on one or more pages intheir site and clicking the icon causes a subscription to be created onthe client side and/or the server side. The subscription is then updatedautomatically.

[0031] According to one embodiment, the presently claimed inventionplays a unique rote as an intermediary and mediator between end users,namely consumers of information, and content providers, namely producersof information. One embodiment of the present invention providesfunctionality that gives to the content provider control over theirbrand, the way in which their information is presented, and the way inwhich users access their web site. At the same time, users are given theability to pick and choose the information they want from the sites theywant, and presentation of that information is accelerated, thusimproving the user's web experience.

[0032] An intelligent caching infrastructure that uses information(called “Meta-Data”) about the content to control client andintermediate caches reduces the wide-area networking problems generallyattributed to interactive content by allowing the caches to manageexpiration, compaction, bulk-delivery and other operations guided viaMeta-Data from the content provider. The caches become intelligentdelivery nodes at the client, and with in the network, because they ableto understand the important properties of the information they aremanaging.

[0033] The presently claimed invention also provides a notificationsystem to the content provider and user that spans Internet, firewall,and internal network systems, and can combine various underlyingtransports and notifications into a general notification architecture.This offers content providers (server) the opportunity to providepro-active information service to the user. The user may receive varioustypes of internet and internal, company notifications and information.

[0034] When the notification and “active information” push model iscombined with an intelligent caching network infrastructure, in thepresently claimed invention, the user achieves the highest degree offunctionality, performance, and usability. As described in furtherdetail below, the presently claimed invention spans client and servermachines, creating a system that allows feedback from the client to theback-end server, and subsequent optimization of the client by theback-end server.

[0035] 2. Components

[0036] The presently claimed invention consists of three majorcomponents: the content bar, caching server and back-end server, asillustrated in FIG. 2. The content bar and the caching server reside onone or more end-user computers (“client machines”) owned by informationsubscribers. The back-end server resides on one or more server-classcomputers owned by information publishers (“back-end server machines”).The content bar and caching server are logical components. Eachcomponent can be implemented as separate processes or within a singleprocess.

[0037] The client machines and the back-end server machines communicateover a network such as the Internet or a corporate intranet. Thecommunication mechanism includes open standard protocols such as HTTP(Hypertext Transfer Protocol), MIME (Multipurpose Internet MailExtensions) and TCP/IP (Transmission Control Protocol/InternetProtocol).

[0038] One embodiment of the present invention is implemented in thecaching server. An alternate embodiment is implemented in a combinationof the caching server and the back-end server. The content bar in eitherembodiment is a rendering environment for published content. Althoughthe following sections describe a two-tier client-server architecture,the presently claimed invention may also be implemented according toother architectures. For example, while the content bar always resideson a subscriber's client machine and the back-end server always resideson a publisher's back-end server machine, there can be any number ofintermediate caching servers between the subscriber and the publisher.

[0039] The caching servers do not have to reside on the client machine.This configuration typically provides the best performance, however, bytaking advantage of local disk access speed to increase performance.Caching servers can be deployed around the network to balance networkload and provide concentration of frequently used information. In thisembodiment, each caching server is implemented as a standard HTTP proxyserver thus allowing them to be coupled together hierarchically. FIG. 3illustrates an overview of how the three components of one embodiment ofthe present invention interact with each other.

[0040] 2.1 Content Bar

[0041] The end user interacts with the content bar. According to oneembodiment, the content bar is responsible for rendering channelsubscription content, configuring each subscription's look and feel, andmanaging the user's interaction with the web. The user can createseveral content bars, and can float them on the desktop, dock them toone of the edges of the display, or dock them to a web browser. Whendocked to one of the screen edges, the content bar can be configured toauto-hide, so that it appears only when the mouse is placed on the edgeof the screen.

[0042] Each content bar contains one or more channels, namely areas ofthe content bar that belong to a particular content provider. Channellook and feel is under the control of the content provider via open dataformats (MIME, HTTP, standard image formats such as GIF and JPEG). Thecontent bar provides a common rendering environment for the channels sothey can be moved, resized, or locally configured by the user in astandard way.

[0043] Each channel contains one or more “subscriptions.” Subscriptionsare agents configured to retrieve information at various times, or toprocess asynchronous notifications of incoming data. Initialconfiguration is set up by the content provider according to the contentprovider's publishing schedules. The user can change this information aswell as add to it. Each channel subscription uses a notificationmechanism to retrieve new data. the notification mechanism can be simplepolling, or more complex asynchronous event notification mechanisms, asdescribed later in the document.

[0044] 2.2 Caching Server

[0045] The caching server manages all of the user's interaction with theweb. All web requests, including those generated by the user's browserand those generated by channel subscriptions, go through the cachingserver. The caching server is responsible for the following areas offunctionality, either alone or in concert with one or more publisherback-end servers. These areas of functionality are described in furtherdetail later in this document:

[0046] Intelligent cache management, including local algorithms forautomatic expiration management and content compaction, and algorithmsshared with the back-end server for custom expiration management.

[0047] Statistics collection and upload to back-end servers

[0048] Lookahead pre-fetch of content based on local algorithms and oncustom control information from back-end servers

[0049] Bulk validation of content based on information (meta-data) fromback-end servers

[0050] Intelligent bandwidth management, allowing user requests priorityover background lookahead pre-fetch requests.

[0051] Registration and subscription by the user to information sources.

[0052] Handling of incoming channel subscription notifications, removingthe need for the caching server to poll its content providers for newinformation.

[0053] A caching server can operate on its own on behalf or a communityof many users, who in turn have caching servers on their own machines.These higher-level caching servers can be placed at intranet/Internetboundaries to provide information concentration and conserve networkbandwidth.

[0054] 2.3 Back End

[0055] The back end server is a collection of software that works withclient caching servers to optimize use of a publisher's site by itssubscribers. According to one embodiment, each publisher has a back-endserver that controls use of its content by clients and feeds informationextracted from the clients back to the publisher. The back-end server isresponsible for the following areas of functionality:

[0056] Maintenance of cache control meta-data. According to oneembodiment, this data is provided to caching servers which use the datato control the way in which the content provider's information iscached.

[0057] Generation of subscription data and subsequent publishing of thatdata for retrieval by caching servers, or subsequent sending of thatdata directly to the caching servers.

[0058] Creation and maintenance of bulk-validation information.Bulk-validation data is initially created by the content provider andsent from the back-end server to caching servers, allowing them tovalidate the publisher's cached content efficiently.

[0059] Creation and maintenance of lookahead information. As above, theinformation is initially created by the content provider. As it is sentto caching servers and used by those servers, updated information isuploaded from the servers and used to fine-tune the lookaheadinformation. The result is a feedback loop that tunes lookahead based onclient use of the publisher's content.

[0060] Generation of content and subscription usage reports. Theback-end server uses statistics uploaded from caching servers to givepublishers an accurate picture of how their site is used, including forexample extremely accurate advertisement display and click-throughcounts.

[0061] 3. Shared Technology

[0062] This section details the technology that is shared by theback-end server and the caching server. According to one embodiment, theinteraction between these two components provides configurationflexibility and efficient performance. By locating accelerationinformation at the publisher, the creation of that data is placed in thehands of the people most likely to know how to manage it, namely thepublisher. By then downloading that information to caching servers, thesystem allows a site to be accelerated according to the wishes of thesite owner, who is in the best position to know how to do this.

[0063] According to one embodiment, the system further introduces afeedback loop so that the publisher is in a position to retrieveaccurate information about their site from their subscribers' cachingservers. That information can be used in its own right to controladvertising rates or provide reports of various sorts. The informationcan also be fed back into the information passing back to the cachingservers, further enhancing their ability to accelerate the site.

[0064] 3.1 Bulk Validation

[0065] 3.1.1 Overview

[0066] Once a cached piece of content expires, the caching server mustvalidate it. This task involves sending a request to the content'sowner, and the content owner either responds that the content has notchanged or provides the latest version of the content. In the lattercase, the caching server still experiences the overhead of retrievingthe content. In the former case, however, the server can increaseperformance significantly by using bulk validation. With bulkvalidation, a single request to the owning server results in largeamounts of content being automatically validated or invalidated. Contentthat is invalidated is marked as expired in the cache, and the next timethe caching server is asked for that content, the caching server knowsto retrieve that content from the content owner. Content that is stillvalid has its expiration date extended and continues to be served fromthe cache until the content finally expires.

[0067] Bulk validation revolves around the idea that a group of contentfrequently has similar expiration characteristics. In a news page, forexample, the masthead, footer, and side bars may be “boilerplate” thatrarely changes, whereas the article and its associated images willchange constantly. In a given site, all content that has similarexpiration characteristics can be grouped into a single list, called aTOC (Table Of Contents). The TOC is an HTML page consisting of a headerand a body that contains tags describing TOC members. The TOC is notintended to be viewed by end users; it is simply meta-data sharedbetween the caching server and its back-end servers and used toconfigure the caching server's behavior. Meta-data is described infurther detail in Section 3.2 Meta-Data.

[0068] Each TOC member is represented by a single ICPAGE HTML tag. Thetag contains a number of attributes (TOCs are also used for lookaheadconfiguration as described in Section 3.4.6, Custom Weight Assignment),including the LASTMOD attribute. This attribute contains the member'slast-modification date in seconds since midnight Jan. 1, 1970 (standardepoch) and is stored with the member in the cache:

[0069] <ICPAGE

[0070] URL=“http://truth.incommon.com/library/achannels.html”

[0071] LASTMOD=83540098>

[0072] The TOC is assigned an expiration date, which automaticallyapplies to all members of the TOC. The expiration date can be assignedexplicitly via a standard HTML META tag (the standard mechanism fordriving HTTP data from content), or via any of the custom expirationmechanisms defined in Section 3.3, Custom Expiration Control. FIG. 4 isa flow chart illustrating an overview of one embodiment of bulkvalidation.

[0073] 3.1.2 Client Side Behavior

[0074] According to one embodiment, the caching server is designed suchthat whenever a TOC member is accessed, the TOC's expiration date alwaysoverrides the member's expiration date. Whenever the TOC expires, allits members automatically expire. Once a TOC expires, bulk validationbegins. Any time one of its members is accessed, the member is noted asexpired (it shares its TOC's expiration date), which causes the TOC tobe retrieved first. The caching server receives one of two responses toits request for the TOC, just as it would for any other page. The firstpossibility is that the TOC has not changed. If so, then the TOCexpiration date is updated according to its meta-data if present or theexpiration algorithm if it is not. Each member automatically gets thenew expiration date.

[0075] The second possibility is that the TOC has indeed changed, inwhich case the owning server sends the new TOC to the caching server.The caching server then parses the TOC's HTML stream looking formembers. Each member is then looked up in the cache and itslast-modification date compared with the incoming TOC copy'slast-modification date for that member. If the dates are the same, thecontent has by definition not changed, and is assigned the TOC's newexpiration date. If the dates are different, the content has changed andis automatically marked as having expired. Next time the content isaccessed, it will be updated from the owning server.

[0076] This behavior results in a significant performance improvement.For example, for a TOC with 100 members (e.g. the boilerplate graphicsfor a web site), a single operation simultaneously validates all 100members. Without a TOC, the caching server would have to perform 100network operations to validate one member at a time, and most of theoperations are likely to be useless because the content has probably notchanged.

[0077] 3.1.3 Locating TOCs

[0078] Once an administrator has generated a TOC, the local cachingserver must be able to find the TOC so that the TOC can be loaded. A TOCis just another form of meta-data. TOC pages are therefore locatedexactly the same way as site meta data is located, as described insection 3.2 Meta-Data.

[0079] 3.1.4 TOC Management

[0080] According to one embodiment of the present invention, TOCs for agiven site are maintained by the back-end server for that site. Each TOChas a well-known name that identifies the TOC uniquely on that site. Thecollection of TOCs on the site comprise a TOC catalog, that can bebrowsed by administrators seeking to create new TOCs, delete TOCs, ormodify existing TOCs.

[0081] The TOC catalog also contains access control and authenticationinformation. The authentication data enables the back-end server toverify the identity of administrators wishing to manage the TOCs in thecatalog. The back-end server uses the access control information torestrict particular administration functions to various groups ofpeople.

[0082] Because the back-end server handles TOC management, the actualcommands to manage a TOC can be issued either from a client machine suchas a client PC or from a management interface on the back-end servermachine. Several client PC-based mechanisms can be used to manage TOCs.The implementor can select a client-side mechanism based on portability,UI functionality, and client environment. Following are examples of suchmechanisms:

[0083] Administrator channel on the channel bar

[0084] Custom version of client with added administrator user interfacefunctionality

[0085] HTML forms accessible from any web browser

[0086] Java applet accessible from any web browser

[0087] Microsoft Windows OCX accessible from any web browser

[0088] The client-side administration mechanisms are responsible forpresenting a graphical view of the TOC catalog and its member TOCs tothe administrator. In addition, the client-side/administration mechanismmay provide some local editing capability as a means of increasingperformance.

[0089] 3.1.5 Creating a TOC

[0090] According to one embodiment, TOCs are created by the back-endserver in at least one of three ways, namely by scanning file systemdirectories recursively, by scanning the content of HTML pages for linksand following those links recursively, or being supplied the TOCinformation directly via database or content management system throughan Application Programming Interface (API). Files (in the first method)or pages (in the second method) are included in the TOC if they satisfyvarious criteria, including but not limited to the following:

[0091] Number of levels to search. When scanning the file system, thenumber of levels of folders/directories to descend before stopping. Whenscanning HTML content for links, the number of levels of links to followbefore stopping.

[0092] A regular expression which, if matched by the file or URL,includes that file or URL in the TOC.

[0093] A regular expression which, if not matched by the file or URL,includes that file or URL in the TOC.

[0094] The TOC can be constructed using any number of differentinclusion and exclusion criteria. The above list is simply a sample ofthe possibilities. The TOC inclusion/exclusion criteria are stored inthe TOC catalog, along with other administrative information such as:

[0095] How often it is updated by the back-end server software

[0096] Any additional update-related inclusion/exclusion criteria

[0097] How it is stored in a catalog of TOCs for maintenance byadministrators

[0098] Any security attributes that control site administrators'abilities to read, modify, or delete the TOC

[0099] 3.1.6 Deleting a TOC

[0100] According to one embodiment, deleting a TOC consists of removingits catalog entry and associated generation criteria. The operation isperformed by the back-end server, either in response to a clientadministration front-end or a back-end server utility. The back-endserver uses its authentication and access control information torestrict TOC deletion to the appropriate administrators.

[0101] 3.1.7 Modifying a TOC

[0102] According to one embodiment, modifying a TOC consists of thefollowing operations:

[0103] Adding one or more member pages

[0104] Removing one or more member pages

[0105] Modifying a page, changing its last-modification date, versionstamp, or lookahead weight

[0106] The modifications can be performed manually by an administrator,or automatically by the back-end server according to scheduling andinclusion/exclusion criteria specified by the administrator. Manualupdates are most useful in assigning custom lookahead weights to memberpages. Manual updates can be performed either on the back-end servermachine via server utility programs, or on a client PC via any of thegraphical administration mechanisms described in section 3.1.4 TOCManagement. Automatic updates are performed by the back-end serveraccording to scheduling information stored as part of the TOC definitionin the TOC catalog.

[0107] According to one embodiment, automatic updates begin with theupdate process saving all members of the current TOC. Any TOC membersthat were manually added by the administrator are specially marked. Theupdate process then retrieves the update inclusion/exclusion criteriafrom the TOC definition and begins generating a new TOC based on thosecriteria. For each TOC member included, the update process determineswhether the same member was present in the previous version of the TOC.If it was, the previous version's lookahead weight is carried over tothe new version, and the previous version is marked as being present inthe new version.

[0108] Once the TOC generation completes, the update process scans theprevious version, copying any administrator-created pages that were notalready part of the new TOC. Finally, the update process incorporatesany additional lookahead weight statistics accumulated since the lastTOC update, and then replaces the old version of the TOC with the newversion.

[0109] 3.1.8 Generalized TOC Usage

[0110] A TOC in its general form is a unit of bulk informationmanagement. It describes a set of related Web objects by URL. The systemthen defines operations on the set such as the bulk validation processdescribed above. Other useful bulk operations can also be defined forTOCs. A TOC can describe a set of objects which are to be retrieved inbulk by the caching server for later off-line viewing. A TOC can alsodescribe a set of objects to be looked ahead on, independently of thelookahead algorithm described later in this document. Sets of webobjects which share caching properties can also be grouped into a TOC.

[0111] 3.2. Meta-Data

[0112] 3.2.1 Overview

[0113] According to one embodiment of the present invention, “meta-data”is used to configure much of the behavior of caching servers. Contentproviders are allowed to configure their sites independently of othersites (or individual subscriptions independently of other subscriptions,all of whom reference the same site) and that data is used to drive thebehavior of the caching servers.

[0114] The term “site meta-data” is used to cover all meta-data thatoptimizes a particular site. The meta-data is stored in HTML tags andcan therefore appear anywhere in a site. To make administrationtractable, the tags are typically grouped into a single page, except forTOCs, which are pointed at from the site meta-data page.

[0115] According to one embodiment of the present invention, a sitemeta-data page is referenced with an ICMETA pointer tag:.

[0116] <ICMETA

[0117] URL=http://www.incommon.com/sitedefs/nytimes_site.html>

[0118] There are a number of ways that the caching server can locatemeta-data pages. According to one embodiment, the site administrator canadd to every page on their site an ICMETA tag identifying the meta-datapage. Whenever the caching server encounters an ICMETA tag in an HTMLpage, it fetches the URL pointed at by the tag. This scheme has theadvantage that the meta-data page gets loaded immediately whenever apage from the site is loaded.

[0119] A variation on the above scheme uses ICMETA tags on only a few“strategic” pages, for example the site's home page. Many sites placelinks to an index or home page on all other pages. The caching serverautomatically looks ahead on all pages, and if all pages have a link toa strategically-located page, the caching server quickly encounters theICMETA tag and fetches the meta-data page.

[0120] According to an alternate embodiment, the channel developer cantag their subscription notifications with a “meta data URL” that thecaching server automatically fetches before it fetches any other channelcontent. This last mechanism guarantees that the meta-data page will beloaded every time new channel data arrives at the caching server.

[0121] 3.2.2 Areas of Use

[0122] Meta-data may be used in the following areas to control cachingserver behavior:

[0123] client configuration by intranet administrators

[0124] custom expiration control

[0125] lookahead pre-fetch configuration

[0126] statistics upload configuration

[0127] Different types of site meta-data are created differently.Because meta-data is implemented in standard HTML, simple configurationscan be created directly by the content provider in a standard text orHTML editor. The data can be dispensed by the content provider'sback-end server at subscription notification time, or can be loaded bythe caching server whenever the caching server encounters theappropriate ICMETA tag. More complex meta-data that is derived from userfeedback can initially be generated automatically and then automaticallyupdated as user feedback is gathered. The latter mechanism isdescribed-in detail further detail in the following sections. Furtherdetails of other meta-data applications are also described in thefollowing sections.

[0128] 3.3 Client Configuration Meta-Data

[0129] Meta-data is used by local network administrators to configurethe client software. The caching server has a built-in subscription toan HTML page containing configuration meta-data. The publisher of thissubscription is the local network administrator, and the publishing hostis an unqualified internet hostname. The caching server uses the hostname “inCommon-Config”. The client's network software will automaticallyqualify this host name in its local internet domain, allowing thecaching server to find a server in any intranet without having to beconfigured at installation time. Note that a server with this name neednot be dedicated for configuration; it is just as easy to give anexisting host an alternate (alias) host name of “inCommon-Config”.

[0130] This configuration mechanism is simple and powerful. It allowsintranet administrators to configure their clients without anyinstallation-time work by the user. Because the configuration data isreceived in the form of a subscription notification, clients willreceive any configuration changes as soon as they are made. If multicastnotification is used, only one copy of the new configuration is sent toall clients. The mechanism is reasonably secure because theconfiguration host name is well-known within the client's local internetdomain, and the client initiates contact with the configurationpublisher. Additional security can be implemented with Secure HTTP anddigital signatures to authenticate the publishing host.

[0131] Notifications for the configuration subscription containconfiguration directives in the form of HTML tags. The caching serverinterprets and then executes these directives. In this manner, thenetwork administrator can control the caching server's caching behaviour(e.g. disabling it entirely on local area networks where the networkspeed is faster than the disk transfer rate), its proxy server, itsdefault lookahead characteristics, even the set of subscriptions itcurrently uses. Any aspect of the caching server that the systemdesigner deems useful to control can be controlled in this manner.

[0132] 3.4 Custom Expiration Control

[0133] 3.4.1 Overview

[0134] According to one embodiment, whenever the caching server is askedto retrieve content from the web, the caching server places the contentin local storage while returning the content to the requestor (eitherthe browser, a subscription, or the server itself). The server thensatisfies subsequent requests for the same content from the localstorage rather than going to the network. This strategy improvesperformance but incurs a cost, namely, if the content changes at itsorigin, the caching server delivers an old copy of the data from localstorage, rather than the new copy from the Internet.

[0135] The caching server solves this problem by assigning each piece ofcontent an expiration date. The server satisfies requests for cachedcontent from local storage until the expiration date is reached, afterwhich time it checks at the origin site to see if the content haschanged. Content providers may control the expiration behavior of cachedcontent in a number of different ways. Flexibility in this area iscrucial because if a piece of content's expiration date is setincorrectly, then the user will either see old data served out of thecache, or will see lower performance as the caching server validates orretrieves content from the network unnecessarily.

[0136] According to one embodiment of the present invention, threemechanisms are used for expiration control, in addition to the standardexpiration control mechanisms offered by HTTP:

[0137] TOC-based expiration control

[0138] ICEXPIRE meta-data expiration control

[0139] automatic expiration control

[0140] The first two types of expiration control are described in thissection, because they work by sharing information between a back-endserver and a caching server. Automatic expiration control is performedentirely by the caching server and is described in section 4.1 AutomaticExpiration Control.

[0141] 3.4.2 TOC-based Expiration Control

[0142] TOCs can be used for bulk validation and for lookahead weightconfiguration, as described earlier in this document. TOC-basedexpiration is simply a way for content providers to apply a particularexpiration date to all members of a TOC without having to modify theindividual members and without having to use meta-data tags.

[0143] The content provider sets the expiration date of the TOC, usingany of the methods specified below, or by simply specifying anexpiration in the content itself using standard HTML META tags or HTTPheaders. All members of the TOC immediately get that TOC's expirationdate.

[0144] TOC-based expiration gives content providers fine control overcontent expiration (in addition to its primary benefits of bulkvalidation and custom lookahead weighting). TOC-based expiration allowscontent providers to group URLs with many different syntaxes but similarexpiration behavior into a single construct (the TOC) and expire themtogether, where otherwise a large number of ICEXPIRE meta-data tagswould need to be used. Again, the goal is flexibility for the contentprovider.

[0145] 3.4.3 ICEXPIRE Overview

[0146] Expiration meta-data allows content providers to use HTML todescribe the expiration control behavior they desire for the cachingserver when serving their content.. Content providers may bind regularexpressions to various expiration control parameters. Any URL thatmatches the regular expression is automatically given an expiration dateaccording to the parameters associated with the regular expression.

[0147] Expiration meta-data is defined and maintained by the back-endserver and can be sent down to a caching server as part of a channelsubscription's notification data, or loaded by the caching server as itencounters ICMETA tags in published content. The ICEXPIRE HTML tag isused to implement this functionality. The tag has four attributes, twoof which control lookup behavior and the remaining of which define theexpiration date.

[0148] Regular expression processing is traditionally slow. Given thatcaching server performance is extremely important, the ICEXPIRE tagprovides a high-speed level of lookup before regular expression matchingis performed. The HOST attribute defines a host name to which theexpiration applies. Only those URLs with a matching host name areconsidered for regular expression matching. The host names can be usedas keys in a hash table, providing a first level of high-speed lookup.Once the correct host is found, the server can travel through the set ofICEXPIRE regular expressions that apply to that host, until a match isfound. Each regular expression is specified with the REGEXP attribute.Once a match is found, the expiration control attributes in the tag areapplied to the matching URL, as described in the following sections. Theremaining two attributes describe a fixed expiration and a minimumexpiration. The uses of these attributes are described in the followingsections.

[0149] 3.4.4 Fixed Expirations

[0150] The most typical use of ICEXPIRE causes all URLs matching theregular expression to be assigned a specific expiration date: <ICEXPIREHOST=www.cnn.com REGEXP=“http://www.cnn.com/sports/.*” EXPIRATION=“Thu,30 Jan. 1997 11:12:13 GMT” >

[0151] In the above example, all URLs on www.cnn.com that are under thesports section are set so they expire on January 30 at the specifiedtime. The expiration date can also be specified in seconds relative tothe current time. In the following example, all matching content gets anexpiration date 600 seconds from the time it was retrieved: <ICEXPIREHOST=www.cnn.com REGEXP=“http://www.cnn.com/sports/.*”EXPIRATION=“+600”>

[0152] 3.4.5 Conservative Expirations

[0153] Some sites may not want any of their content cached at all. Theseare sites whose content changes rapidly or unpredictably, or whosecontent is extremely short-lived. An ICEXPIRE whose EXPIRATION attributeis set to the special value ALWAYS tells the server that all contentsatisfying the lookahead tag's match criteria are to always be verifiedfrom the network, and satisfied from the cache only if the content hasnot changed. The result is poorer performance but greater accuracy.<ICEXPIRE HOST=www.cnn.com REGEXP=“http://www.cnn.com/sports/.*”EXPIRATION=ALWAYS>

[0154] Content providers are thus able to control the behavior of thecontent without modifying the content itself. The expiration meta-datacan be applied at an arbitrarily fine granularity by tuning the ICEXPIREregular expression appropriately.

[0155] 3.4.6 Liberal Expirations

[0156] At the other end of the spectrum are sites that never want theircontent to expire. These types of sites contain content that appearsonce (possibly for a very short time) and never changes. Examples arenewspaper or journal articles with URL identifiers that are neverre-used. For this type of content, a special value NEVER is provided foran ICEXPIRE EXPIRATION attribute. The value works exactly as above,except that content matching the lookahead configuration never expires.<ICEXPIRE HOST=www.cnn.com REGEXP=“http://www.cnn.com/sports/.*”EXPIRATION=NEVER>

[0157] 3.4.7 Minimum Expirations

[0158] Expiration dates computed with the automatic expiration algorithmdescribed in Section 4.1 Automatic Expiration Control, can sometimesyield non-intuitive results. If, for example, an object is retrievedimmediately after it is modified, and the object does not have enoughlifetime samples, the resulting new expiration date will be very short,causing the object to expire earlier than it should.

[0159] Content providers can override an expiration date calculated bythe algorithm with a minimum value. If the expiration date computed bythe algorithm is below the minimum, the minimum is used. The minimumexpiration is defined in the MIN_EXPIRATION attribute of the ICEXPIREHTML tag. It can be specified as a time in seconds relative to the timeit was retrieved, or as an absolute date in standard HTTP format.<ICEXPIRE HOST=www.cnn.com REGEXP=“http://www.cnn.com/sports/.*”MIN_EXPIRATION=“+600”>

[0160] In this example, all matching content has its expirationcalculated using the automatic expiration algorithm, but if that resultis less than 600 seconds from now, the expiration is set to 600 secondsfrom now. <ICEXPIRE HOST=www.cnn.comREGEXP=“http://www.cnn.com/sports/.*” MIN_EXPIRATION=“Sun, 2 Feb. 199710:10:10 GMT”>

[0161] In the second example, the behavior is identical, but the minimumexpiration is a specific time in the future.

[0162] 3.5 Lookahead

[0163] 3.5.1 Overview

[0164] One of the major areas where the caching server adds value is inlookahead (pre-fetch of content). Most caching is based on past usage ofthe network: the user visits a site and their web browser stores thatsite's content for a set period of time. If the site is re-visited, thecontent is fetched locally, rather than over the network.

[0165] According to one embodiment, lookahead caching uses predictivealgorithms to determine where a user may go given their currentlocation. Lookahead caching then attempts to fetch the desired contentbefore the user actually travels to the new location. Thus, when theuser actually travels along a web link to a new page, that page isalready present locally and can be displayed very quickly.

[0166] A number of different mechanisms allow content providers to tunethe lookahead algorithm with a great degree of flexibility for theirparticular site layout. According to one embodiment, the lookaheadalgorithm runs on the caching server. The lookahead algorithm usestuning information created by content providers and maintained by apublisher's back-end server. Statistics gathering mechanisms areintegrated with lookahead tuning, creating a feedback cycle where usageof a site causes statistic uploads to the back-end server, which thenautomatically aggregates and updates the tuning information anddownloads the result to all subscriber caching servers. FIG. 5 is a flowchart illustrating an overview of statistics gathering according to oneembodiment. FIG. 6 is a flow chart illustrating an overview of oneembodiment of lookahead caching.

[0167] 3.5.2 Terminology

[0168] Following are some terms used in the next sections: Initial pageWhenever the user requests a page from their web browser, that page islooked ahead upon. That user-requested page is known as the “initialpage”. Child page A page reachable via URL from a “parent page”. Thelookahead algorithm works by analyzing child links of the initial page,and then recursing on the pages pointed to by each child link. Parentpage An HTML page which whose children are analyzed by the lookaheadalgorithm. Lookahead Level Also known as “lookahead depth”. The numberof links between the initial page and the current page. A lookaheadlevel of 1 includes all child pages of the initial page, together withtheir inline images and applets. A lookahead level of 2 includes thechild pages of each child page of the initial page, together with theirinline images. Repeat for levels 3 and 4. Levels of 3 and above includean enormous number of pages. Positive filter A regular expressionapplied to child link URLs. If the URL matches the expression, thelookahead algorithm continues at that point, otherwise it stops.Negative filter As above, but lookahead continues only if the URL doesnot match the regular expression. Weight An arbitrary number assigned toa child link, representing the probability relative to its siblings thatit will be traversed by the user. Score The relative importance of onelookahead request relative to all other requests currently queued. Thisincludes requests made from other browser windows, or by othercomponents of the system.

[0169] 3.5.3 Algorithm Overview

[0170] According to one embodiment, the caching server's lookaheadalgorithm starts by assigning a score to an initial page. When the pagearrives, the caching server scans the page for child links to otherpages. The caching server then assigns each of these child pages a“weight,” or likelihood that the child pages may be traveled along bythe user. The weights are numbers that represent the likelihood relativeto other pages on the site that that page may be accessed. The weight isthe content provider's opinion of the page's access likelihood relativeto other pages. According to one embodiment, the user's actual browsingbehavior is not taken into account in determining the weight.

[0171] According to an alternate embodiment, the user's browsingbehavior is also examined. Both weight and browsing behavior are thenfed into an algorithm, together with the parent page's score. Thatresults in a child page score. The caching server then queues requestsfor all the child pages in descending score order and begins fetchingthem. The algorithm has the property that the child link scores, whenadded together, result in the parent page's score. The scoring algorithmis described in detail in the next section.

[0172] According to one embodiment, lookahead processing is recursive.As soon as a lookahead request completes, the retrieved page is analyzedfor links and the same scoring algorithm is executed over those links toyield a set of probabilities that they will be traveled along if theparent page is reached. Because a given page's links' scores all sum tothe parent's scores, this embodiment of the algorithm has the desirablebehavior that all lookahead requests generated from a given page havescores that preserve the likelihood relative to one another that thepage will be selected by the user. The algorithm also has the desirablebehavior that it converges automatically. Each page's score gets smallerthe farther it is from the original page. Eventually it reaches zero, ata rate inversely proportional to its weight and the weights of itsparents. The higher a link's weight, the larger percentage of itsparent's score it will receive and propagate to its children.

[0173] The algorithm is halted as soon as the user requests a new page.The algorithm is then started from the new initial page. Any existinglookahead requests remain queued, but their scores are set to zero. Ifthey appear in the new run of the lookahead algorithm, then their scoreswill be assigned as appropriate, relative to the user's new position.Once the new run completes, any leftover requests with zero scores areeliminated, since it becomes highly unlikely based on the algorithm thatthe user will want those pages. This behavior has the desirable propertythat the number of queued lookahead requests remains bounded and doesn'tconsume too much caching server resource.

[0174] Many instances of the algorithm can run in parallel. The cachingserver can, for example, run an instance of the algorithm for eachbrowser window that the user has active. If the caching server runs as anetwork service used by multiple users each with their own browsers,then the caching server can run an instance of the algorithm for eachbrowser. Furthermore, the caching server can run an instance of thelookahead algorithm for each subscription (see following) that the userhas created.

[0175] Each algorithm instance runs in parallel with the otheralgorithms, and can have its own configuration. The scores resultingfrom each algorithm's analysis are all absolute, meaning that the scoreassigned the initial request is the only way for one instance'sresulting lookahead requests to be more important than any otherinstance's requests.

[0176] 3.5.4 Child Scoring Algorithm

[0177] According to one embodiment, a child link's score has threeinputs: the user's past browsing behavior on that link's page, thecontent provider's weight for that child page relative to the otherpages on the site, and the parent's score. Inline images, applets, andcertain other objects are always assigned their parent's weight. Objectsof these types are always automatically requested by the browserwhenever the browser sees their links, so by definition they are aslikely to be accessed as their parent, hence getting their parent'sweight.

[0178] HTML pages are more difficult to score. The algorithm must paymost of its attention to the content-provider-assigned page weight untilthe user's browsing behavior becomes statistically valid. In the absenceof any user input, the content provider clearly has an idea of how theircontent is used, particularly when their weights are determined via astatistic gathering mechanism (described later). The algorithm assignsthe right proportion of user behavior and content-provider-assignedweight to the score so that the more the user visits the site, the morehis or her input is valued.

[0179] The algorithm also weighs user behavior such that a user needonly visit a small site (site with a small number of pages) a smallnumber of times to get a statistically valid surfing sample. A user mustcorrespondingly visit a large site (site with a large number of pages) alarge number of times to get the same statistically valid sample. Forthe lookahead algorithm to correctly represent probabilities, the sum ofall child page scores on a parent page must sum to the parent page'sscore.

[0180] According to one embodiment, the algorithm is logarithmic toallow the user's behavior to overwhelm the content provider's weightingas soon as statistically possible, then level off as the user visits thesite more and more, so that there is always a minimum effect from thecontent-provider's weights. A piece-wise linear approximation may alsobe utilized.

[0181] According to one embodiment of the present invention, thealgorithm is as follows: p = parentScore * [ (M1(n, nHits) * (nHits /totalHits)) + (M2(n, nHits) * (weight / totalweight))] if hits <= 3n,then  M1(n, nHits) is: hits / 4n  M2(n, nHits) is: (4n − hits) / 4n ifhits > 3n, then  M1(n, nHits) is: hits / (hits + n)  M2(n, nHits) is: (n/ (hits + n) n = max(1, totalLinks / 4)

[0182] nHits is the number of times the child page has been accessed bythe user. TotalHits is the total number of times all pages accessiblefrom the parent page have been accessed by the user. Weight is the childpage's content-provider-assigned page weight. TotalWeight is the totalof the weights of all pages accessible from the parent page. Nrepresents the size of the site and is described in detail below.Functions M1 and M2 are both scaled by percentages, M1 by the percentageof all the parent page's children that this child has been hit, and M2by the percentage that this child's weight is of the total child weightson this page. Each of the two scaled terms of each child page willtherefore sum to the parent's total, giving the desired behavior thatthe sum of all child scores is the parent score.

[0183] A piece-wise linear approximation of logarithmic behavior isobtained by dividing the scoring algorithm into two areas, one used whenthe number of hits is less than 3N and one when the number of hits isgreater than 3N. The “knee” in the approximated curve occurs when thenumber of hits is exactly 3N. Variable N is used instead of a constantnumber of hits to parameterise the knee in the curve by the size of thesite. The user is not granted a statistically valid browsing sampleuntil the user has accessed a certain percentage of a site. Otherwisethe user's browsing activity would be too heavily too quickly on a smallsite for adequate behavior in a large site, or too heavily too slowly ona large site for adequate behavior for a small site.

[0184] The two functions M1 and M2 are complementary. M1 (theuser-behavior term) gets larger and M2 (the content-provider term) getssmaller the more the user visits the site. The M1 value increases andthe M2 value decreases rapidly in the first part of the algorithm, wherethe user has hit the site fewer than 75% of the total number of links(3N=3*(number of links/4)). In a site with 28 links, for example, after7 hits, the user behavior is weighted 25% and the content-provider 75%.After 14 hits, the relative percentages are 50-50, and after 21 hits,75/25.

[0185] After this point the algorithm drops into a slower mode, wherethe percentage of user-derived behavior increases slowly relative to thecontent-provider-derived behavior. After 28 hits in the above example,the relative percentages are 80/20. The number of hits has increased by50% but the percentage of user-derived behavior is has only risen 5%.Thus, in the beginning, almost all of the child's score comes from itsweight over the weight total. As the number of hits increases, more andmore of the score comes from the number of hits, until finally it levelsoff at a contribution of 75% to the total score.

[0186] 3.5.5 Default Weight Assignment

[0187] Proper assignment of link weights is crucial to correct lookaheadbehavior. The caching server can automatically assign weights to eachlink. Given that the caching server knows nothing about the site'ssemantics, however, there is no way for the caching server to do morethan make basic assumptions about link placement and use thoseassumptions to assign weights. By comparing the link placement to anumber of different stored profiles, the caching server can apply one ofa suite of link assignment algorithms to the page.

[0188] Alternatively, according to one embodiment of the presentinvention, weights can be assigned in exponentially decreasing magnitudeto links as the links are encountered on the page. This embodiment isbased on the assumption that links near the start of the page are morelikely to be traversed than links farther down the page. The algorithmdivides links into two categories: links on the same site as the parentpage, and links on a different site. The algorithm gives higher weightsto links in the same site. The algorithm looks for links in the orderencountered in the parent page's HTML stream. The first N links areassigned the maximum weight W. After each N links, the current maximumweight W is lowered by a scaling factor S. This process continues untilthe W hits a minimum value. According to one embodiment, W is 256, N is4, and S is 0.5. Thus the first four links encountered are assigned aweight of 256, the next 4 get 128, the next get 64, and so on down to 1.Other values of W, N and S may also be utilized.

[0189] Links outside the parent page's site are assigned a reducedweight, under the assumption that the user is less likely to stray fromthe site than stay in the site. Each off-site link is assigned a weightwhich is the current maximum weight W reduced by a reduction factor R.The off-site links do not count toward N, i.e. the off-site links nevercause W to be lowered. According to one embodiment, R is 0.25. Thusevery off-site link encountered during the first 4 on-site links gets aweight of 64, then 32 if encountered during the next 4 on-site links,and so on down to 1.

[0190] 3.5.6 Custom Weight Assignment

[0191] According to one embodiment, in situations where the defaultweighting algorithm is inappropriate, the system allows the contentprovider to specify weights for each page on the site. The weights arestored in a TOC, described in Section 3.1, Bulk Validation. Each URLdefined by a TOC ICPAGE tag can have its own custom weight, in additionto the standard bulk-validation information.

[0192] Each ICPAGE tag in a TOC contains, in addition to the member URLname and its validation information, a content-provider-supplied weight:<ICPAGE URL=http://truth.incommon.com/library/achannels.html  WEIGHT=1234 LASTMOD=84762964>

[0193] That weight is then used in place of any weight calculated by thecaching server's default weight assignment algorithm. The TOC caninitially be generated automatically by the back-end server, withdefault weights assigned via the algorithm described in the previoussection. As caching servers run on many client machines, they uploadtheir usage statistics to a central collection point. Each cachingserver uploads a version of its TOC periodically, the exact frequencybeing defined by the content provider as part of the TOC data. Theuploaded information contains each TOC member URL and the number oftimes the uploading caching server has accessed the particular member.

[0194] The result of the uploads is an exact picture of the site's usagepatterns, on a per-URL basis, automatically grouped by TOC. In additionto being valuable site organization data, the hit counts can beaggregated manually, automatically, or a combination of both, and fedback into a new version of the TOC, which in turn is downloaded to thecaching servers as their copies of the TOC expire. This feedback cycleautomatically tunes the lookahead algorithm on a per-TOC basis, exactlyin accordance with actual usage patterns.

[0195] 3.5.7 Other Lookahead Configuration

[0196] While weight calculation and page scoring are the foundation ofsuccessful site lookahead, the lookahead algorithm can also beconfigured along several other axes. Moreover, the algorithm can beconfigured separately for different sites, even for differentsubscriptions in the same site. Each configuration is described by apiece of meta-data called an ICLOOKAHEAD tag. That tag can be placedanywhere, and typically appears in one of two places. According to oneembodiment of the present invention, the tag can be placed in a sitemeta-data page along with other meta-data, such as pointers to TOCinformation, or regular-expression-based ICEXPIRES expiration tags.According to another embodiment, the tag can be transmitted as MIME datain channel subscription notification data. In both cases, theconfiguration itself is identical but retrieved differently.

[0197] The lookahead configuration tag controls the following algorithmparameters in addition to the custom expiration data described earlierin this document:

[0198] maximum depth

[0199] maximum number of links

[0200] pruning regular expressions

[0201] pruning file types

[0202] lookahead off site or not

[0203] lookahead out of TOC or not

[0204] lookahead on images, audio, or video

[0205] lookahead for a maximum amount of time

[0206] lookahead to a maximum amount of data

[0207] The configuration scheme is HTML and is therefore easilyextensible. Following is an example of the ICLOOKAHEAD tag and itsattributes: <ICLOOKAHEAD DOMAIN_NAME=cnn.com HOST_REGEXP=.*.cnn.comMAX_DEPTH=2 MAX_LINKS=50 NO_GRAPHICS=FALSE NO_OFFSITE=TRUEPROCEED_IF_MATCH=“.*/topstories/*.html”>

[0208] In addition to lookahead configurations that are bound to channelsubscriptions, the content provider can have any number of lookaheadconfigurations bound to site (host) name regular expressions. Accordingto one embodiment in order to improve performance, the caching serveruses a two-stage lookup mechanism similar to that used by ICEXPIRE tags.In this case the first stage is the host's “domain”, i.e. the last twolabels of the host name. The domain is stored in a hash table and can belooked up quickly. Whenever a page is looked ahead on, its URL's hostname's domain is looked up in the hash table. If an entry is found, alllookahead configurations for that domain have their host name regularexpressions compared against the URL's host name. The configurationwhose host name regular expression first matches the URL's host name isused to configure lookahead for that URL. The two-stage lookup algorithmthus ensures that domains with no custom lookahead are not slowed bydomains with lots of custom lookahead.

[0209] 3.5.8 Depth Configuration

[0210] The maximum depth parameter controls how many levels of links arechased by the lookahead algorithm. Level one lookahead consists of thecurrent page's links and all of their inline images. Level two lookaheadconsists of level one lookahead plus each child page's links and all oftheir inline images. Levels three and four extend the algorithm further.Traditional “spidering” algorithms chase all links to a specific level.The result is an explosion of requests at levels above two, renderingthe spidering almost useless unless the user has high-speed networkaccess and lots of time on their hands.

[0211] One embodiment of the present invention uses a combination oflevel and lookahead score to regulate lookahead and keep it effective.Any potential lookahead request below level one must have a score abovea minimum cutoff in order to be considered. The cutoff score is{fraction (1/50)} the original page's score. Thus as the lookaheadalgorithm gets farther and farther removed from the original page, thescores it derives get smaller and smaller, until they fall below thecutoff.

[0212] Scores do not drop uniformly but rather according to the relativeimportance of the various links traversed. If one link on a page has 75%of the total score on that page, the other links at its level may allhave scores too low to allow lookahead to continue through them. Thepage with the large score may, however, have a score large enough for itand its children to be looked ahead on. This depends on the page scores,which in turn depend on user behavior and content-provider-drivenbehavior. The merging of cutoff score, the scoring algorithm, and levelcontrol gives the content provider exact control over how lookahead isperformed. The result is far more accurate than traditional spidering.

[0213] 3.5.9 Link Count Configuration

[0214] Link counts are the second parameter to lookahead configuration.The content provider can control the maximum number of links lookedahead on the initial page. The link count applies only to the initialpage's links. Their child links are looked ahead according to cutoffscore, as detailed in the algorithm description previously.

[0215] Link count creates flexibility for content providers. Normally,all links off the initial page are looked ahead on. Lookahead score isused only to prioritize the requests. If the initial page has a largenumber of links, however, the algorithm may spend too much time lookingahead on links unlikely to be traversed by the user. By restricting thelink count to a smaller number, the lookahead algorithm can completemore quickly, and spend its time analyzing subsequent pages.

[0216] 3.5.10 Lookahead on Images, Audio, Video

[0217] According to one embodiment, the lookahead algorithm can befurther tuned by being configured not to look ahead on images, audiocontent, or video content. These types of content are typically muchlarger than HTML pages, and therefore take longer to download. In thetime taken to download a single JPEG image, for example, the servercould download ten or fifteen HTML pages. In the time taken to downloada single WAV file (audio file) tens of HTML pages could be loaded. Thesavings is even greater for video content.

[0218] Users generally want images. Content providers also general wishimages downloaded, particularly if the images are advertisements.Occasionally, however, site images are not important relative to thetext on the page. In those cases, the content provider may want todisable lookahead on the images and download the text content much morerapidly. Audio and video, for example, tend to be much less important tothe look and feel of a page, and may therefore typically be disabled. Inaddition, image, audio, and video lookahead can be tuned so that ratherthan either happening at parent priority or not all, they happen belowparent priority, possibly after all text lookahead has completed.

[0219] 3.5.11 Pruning Regular Expressions

[0220] The content provider can create two regular expressions in eachlookahead configuration, as illustrated in FIGS. 7A-B. One expression isused as a “positive filter” for links on a page. For a page link to beconsidered for lookahead, its URL must match the positive-filter regularexpression in FIG. 7A. The regular expression syntax used is notimportant, as long as it has the functionality to match a reasonablevariety of URLs without too much work by the content provider. In thefollowing examples, the regular expression language is that used by theGNU Emacs text editor. Examples of other equally usable languages arethose used in the Posix Unix standard, Microsoft Developer Studio, theUnix “egrep” program, or the Epsilon text editor.

[0221] The second regular expression is used as a “negative filter” forlinks on a page, as illustrated in FIG. 7B. For a page link to beconsidered for lookahead, its URL must not match the negative-filterregular expression. This type of regular expression is useful forscreening out certain types of links that the content provider wantsnever to be looked ahead on. Typical candidates are executable files,full-motion video, or sound files.

[0222] 3.5.12 Lookahead Off Site

[0223] This lookahead tuning parameter is a Boolean value that controlswhether lookahead is performed on “off-site” links. Off-site links aredefined to be those links whose URL host component is different from thehost component of their parent page. Content providers may set thisvalue to “no lookahead off-site” if they do not wish the caching serverto expend resources such as network access time, processing time, ordisk storage looking ahead for pages not owned by the content provider.

[0224] 3.5.13 Default Lookahead Configurations

[0225] Just as the caching server has a default link weight computationalgorithm for use in cases where the content provider does not provideexplicit weights in their TOC meta-data, the caching server also hasdefault lookahead configurations for cases where the content providerhas no applicable ICLOOKAHEAD information in their site meta-data, or nolookahead configuration is provided with a subscription channelnotification. According to one embodiment, the caching server uses twosimilar default lookahead configurations, one for browser-basedlookahead and one for channel-subscription-based lookahead. Additionalconfigurations can easily be stored, and the existing configurations mayalso be modified by the end user.

[0226] The default configuration currently used by the caching serverfor browser-based lookahead has the following settings:

[0227] Maximum depth 1 (initial page's children and their inline images)

[0228] Maximum links 50 (the first 50 links encountered on the initialpage)

[0229] Lookahead off-site

[0230] No positive-filter regular expression

[0231] Lookahead on images, but neither audio nor video

[0232] Negative-filter regular expression to remove executable files,server-side image maps, and binary data files

[0233] 3.6 Subscriptions and Notification

[0234] Once a user subscribes to content from a publisher, that contentmust be delivered to the user's desktop. This section describes theprocess by which a user subscribes to content, and it describes thesystem that pushes notification data from the publisher to the user'sdesktop.

[0235] Creating a subscription is intended to be a simple, lightweightprocess, for example a single click of a button on a web page by theuser. Similarly, deleting a subscription is a simple operation performedon the content bar by the user. Subscribing to content begins the flowof notifications from the content provider to the user. Unsubscribingterminates the flow of notifications, guaranteeing that the user sees nomore information from that subscription.

[0236] The subscription process is also intended to be highlyconfigurable. There are a number of different notification mechanismsavailable to the publisher, and each is appropriate in differentsituations. The notification system allows the content provider and theclient to negotiate notification mechanisms, and further allowsrestrictions by intranet administrators, so that clients on privatenetworks operate under rules defined by the network administrator nomatter what the information publisher wants.

[0237] 3.6.1 Service Requirements

[0238] The following describes some of the different areas that anotification service must address:

[0239] shared data

[0240] personalized data

[0241] reliable delivery

[0242] confirmed delivery

[0243] store-and-forward delivery

[0244] internets and firewalls

[0245] security

[0246] anonymity

[0247] user control over subscriptions

[0248] 3.6.1.1 Shared Data

[0249] According to one embodiment, the notification system must be ableto handle effectively data shared by a large user community. Given thatthe data is shared, notifying subscribers of its presence is mosteffectively performed by a multicast protocol. Multicast protocols savenetwork bandwidth, improve origin server performance by sending only asingle copy of the data, and keep the origin server from having tomaintain subscriber lists (although such lists may be maintained forother reasons).

[0250] 3.6.1.2 Personalized Data

[0251] At the other end of the spectrum is highly personalized data,such as stock portfolio updates and personalized newspapers. The networkoverhead of maintaining multicast groups in this instance is wastedbecause there is only ever one recipient of the data. Instead, thesystem must be able to unicast notification data, or at least anindicator that such data is available.

[0252] 3.6.1.3 Reliable Delivery

[0253] Publishers originating notification data need to know that theirsubscribers will receive the data. “Reliable” in this context is fairlybasic, on the order of email-based reliability. The system mustguarantee that the data arrive at all subscribers “eventually”, with thepublisher having at least some control over the maximum time beyondwhich it knows that 99% of the subscribers have received the data.

[0254] 3.6.1.4 Confirmed Delivery

[0255] Confirmed delivery takes reliable delivery one step farther. Thepublisher not only needs to know that its data will eventually bereceived by all subscribers, they also need to know which subscribersreceived the data and at what time. Such a system requires subscriberlists, with individual subscribers contacting the publisher on receipt.This type of return-receipt-request may have an impact on theperformance of the system.

[0256] 3.6.1.5 Store and Forward Delivery

[0257] A class of subscriber machine that is not always connected to thenetwork includes laptops that dock with a networked stationperiodically, or a machine that dials into the network periodically topick up information. Another class of machines uses DHCP (Dynamic HostConfiguration Protocol) for dynamic IP address management. Such machinesmay overlap with the previous class of machines, but also includedesktop machines permanently connected to the network but whoseaddresses are managed dynamically.

[0258] Both classes of machines can benefit from using a proxynotification server that is responsible for handling incomingnotifications and buffering them for the user. If such a service is notavailable, frequently-disconnected machines will be forced to poll,since they cannot count on receiving the notifications. DHCP machinesare forced to poll if they operate in a unicast-only environment,because unicast requires address lists, and it is not possible tomaintain address lists effectively if the addresses change constantly.Using host names for address transparency does not work either, becausemany of the machines do not have names.

[0259] 3.6.1.6 Internets and Firewalls

[0260] The notification system must be general enough to perform well inan intranet context as well as in an Internet context. One obviousproblem with use on the Internet is that the publishers and theirsubscribers will frequently be on opposite sides of a firewall.Firewalls are frequently configured to let requests out into theInternet, but to bar unsolicited information other than email fromtravelling into the intranet.

[0261] The notification system needs to function reasonably well in afirewall environment that behaves in this manner. The notificationsystem also needs to offer notification functionality that is simpleenough that network administrators can scope any security issues easily.The fewer the security concerns, the more likely notifications may beallowed through a firewall by network administrators who believe thebenefits of asynchronous notifications in terms of network bandwidthsavings make it worthwhile to reconfigure their firewall software.

[0262] 3.6.1.7 Security

[0263] The notification system uses existing security infrastructure togive subscribers assurance that incoming notifications are indeed fromthe desired publisher, and not from a malicious third party. Inaddition, notification data is encryptable.

[0264] 3.6.1.8 Anonymity

[0265] Subscribers may wish to remain anonymous from publishers. Thenotification system must be able to provide a level of indirectionbetween publishers and subscribers that implements anonymity. Amulticast notification system by itself does not guarantee anonymity.Instead, the system needs to use proxy notification servers that act onbehalf of client wishing to remain anonymous.

[0266] 3.6.1.8 User Control over Subscriptions

[0267] Once a user creates a subscription, he or she must be able toremove that subscription and know that the system will immediately stopaccepting notifications for that subscription. This gives the userfine-grain control over the types of information he or she receives anddoes not allow the provider undue privilege. This is a key solution to amajor problem with electronic mail: unsolicited email. Email as anotification mechanism, operates at too gross an addressing level. Bygiving one's electronic mail address to a publisher, the user loses theability to screen out future unwanted information from that publisher,and has no control if that publisher passes the email address to someoneelse. The key to solving this problem is the scheme of registration andsubscription where the user retains control of whether to accept orreject information on a fine-grain subscription level.

[0268] 3.6.2 Notification Services

[0269] The following section describes in detail the differentcomponents of the notification system and how the components implementthe various notifications requirements.

[0270] 3.6.2.1 Components

[0271] The system comprises the following components:

[0272] Client drivers, one per notification mechanism.

[0273] Unreliable ping protocol, either unicast or multicast

[0274] Unreliable notification protocol, intended for multicast

[0275] Synchronous request algorithm

[0276] Return-receipt support

[0277] Backup polling algorithm

[0278] Notification proxy server

[0279] Subscriber list management

[0280] Subscription meta-data describing parameters in section 3.2 MetaData.

[0281] 3.6.2.2 Client Drivers

[0282] According to one embodiment, notification services areimplemented as loadable “drivers”, each implementing a common serviceinterface. A partial list of operations follows:

[0283] start

[0284] stop

[0285] subscribe

[0286] unsubscribe

[0287] show-configuration

[0288] configure

[0289] According to one embodiment, in order to implement a newnotification mechanism, the standard driver interface is implemented. Acommon notification system manager handles all generic tasks, such assubscription validation, driver management, and delivery of informationto the caching server.

[0290] 3.6.2.3 Reliability and Notification

[0291] The system does not attempt to notify using a reliable transportprotocol. As far as the user is concerned, the high level notificationprocess provides reliable delivery, but the system implements reliabledelivery by using a combination of lightweight unreliable asynchronousnotification, synchronous requests; and backup polling.

[0292] There are a number of problems with reliable transport. In themulticast world, reliability is difficult to implement well. Protocolslike RMTP, for example, deal with various aspects of reliability, but atthe expense of complexity. In the unicast world, reliable transport viaTCP is easy to implement, but does not provide any bandwidth or serverperformance savings over unreliable notification followed by synchronousrequests. In fact, the latter mechanism can provide higher performancethan reliable unicast if caching servers are used (see following).

[0293] According to one embodiment of the present invention,asynchronous notification is implemented by providing an unreliablemulticast atop IP multicast, and a very simple unicast “ping” protocol.Unreliable multicast over LANs will end up being reliable most of thetime without requiring all the additional protocol complexity. Unicastwill simply provide ping functionality, since transmitting the dataitself to all recipients takes longer than asking the recipients to askfor the data.

[0294] 3.6.2.4 Unreliable Ping Protocol

[0295] In the unicast world, asynchronous notification by transmittingthe entire notification is not practical. The publisher becomesresponsible for sending a copy of the data to every subscriber, which isno different from the subscribers asking for it, providing thesubscribers never ask unless there is new data available.

[0296] The “ping” protocol implements a means for the publisher tonotify subscribers that new data is available for them to retrieve. Thisprotocol immediately improves performance over simple polling becausesubscribers only ask for data when new data is available. The process isanalogous to the post office leaving a user a note that there is apackage waiting to be picked up. The user does not have to drive to thepost office every day, but rather only when a note tells the user that apackage is waiting.

[0297] Each subscriber thus needs to request data from the server. Inthe case where the information is shared and public, whenever asubscriber receives a ping, they wait a random amount of time beforerequesting the information. The first subscriber on a network segmentrequests the data of a caching proxy server or notification proxy. Thatentity then requests the data of the publisher. The random wait preventsall subscribers from asking at once, and increases substantially thelikelihood that they can get the data from a cache instead directly fromthe publisher, thus reducing server load. Even if there are nointermediaries between the subscriber and the publisher, the random waitdistributes the load at the publisher.

[0298] If the data is not shared, then each subscriber does have torequest the information from the publisher. But the overhead oftransporting the information from publisher to subscriber would stillhave to happen once per subscriber. Multicasting personalizedinformation does not render a benefit. Having the subscriber request theinformation is better than the reverse because the mechanisms alreadyexist, they pass through firewalls, and they do not require additionalstore and forward infrastructure at the publisher.

[0299] The ping protocol is inherently unreliable, thus requiring amechanism to deal with lost pings. Sequence numbers incremented by onemay be used for each notification sent by the publisher. A subscriberthat sees a hole in the sequence space simply asks for the missingnotification(s). This mechanism is only necessary if the notificationscomprise a stream of data, all elements of which must be received by thesubscriber. If new notifications subsume older ones, then the sequencescheme does not need to be used.

[0300] Whether or not sequencing is used, the system also has to handlesituations where no notifications arrive for “too long”. “Too long” is atime period defined by the publisher in the subscription meta-data sentto the subscriber at subscription time. When that time period elapseswith no notifications, the subscriber polls the publisher for anychanges and resets its timer. Whenever a notification arrives, the timeperiod is reset, so that a poll only occurs N minutes after no word hasarrived from the publisher. As long as the publisher's notificationsarrive at regular intervals driven by the content, polling will almostnever occur. Polling will occur only in the unlikely event that a packetwas dropped between the publisher and the subscriber, or in the casewhere the subscriber's machine was disconnected from the network for asufficiently long period of time.

[0301] The ping protocol works as follows: A single UDP packet is sentto each subscriber. In multicast configurations, the packet is sent tothe subscription's multicast group. The packet contains the followinginformation:

[0302] Publisher host name

[0303] Subscription identifier

[0304] Sequence number

[0305] URL to request, if it changes constantly and cannot therefore bepart of the subscription meta-data. URL is parameterised by subscriberidentifier.

[0306] 3.6.2.5 Unreliable Multicast Protocol

[0307] Delivery of actual notification data in multicast-enabledenvironments incurs bandwidth savings and performance gain on serversthat do not need to waste time sending multiple copies of the data. LANsare generally reliable and the likelihood that a multicast notificationwill be received in its entirety is high. Backup request and pollingmechanisms may thus rarely be required.

[0308] According to one embodiment, the protocol must provide errordetection, so that subscribers know if they missed a packet and canrequest the data of the server directly. A simple packet sequencingscheme works just fine, and a higher-level notification sequencingscheme tells subscribers when they have missed a complete series ofpackets, or the last packet of the previous notification.

[0309] The notification data is broken up into UDP packets with thefollowing header information:

[0310] Publisher host name

[0311] Subscription identifier

[0312] Data checksum

[0313] Data length

[0314] Return-receipt URL, parameterised for subscriber identifier

[0315] Notification sequence number within subscription

[0316] Packet sequence number within notification

[0317] last-packet indicator

[0318] The notification data is broken up into packets by the protocol,and each packet is then multicast to all interested parties. The lastpacket in the message is tagged with an indicator so that recipientsknow when the message has been received. The protocol ensures that asfew packets as possible are dropped. The protocol can easily combinepackets into groups and wait a fixed small amount of time betweentransmissions of the next group. The publisher may use S-MIME to signand optionally encrypt the notification content to ensure authenticationsecurity.

[0319] 3.6.2.6 Synchronous Requests

[0320] Synchronous requests are an integral part of the overall system'sreliable-delivery semantics, because the notification protocols areunreliable and may not even carry any data. Any time synchronousrequests are used, the publisher is in danger of overloading. Tominimize the risk, all synchronous requests are preceded by a randomwait interval. Whenever a ping notification is received, each recipientwaits a random amount of time before requesting the notification data.Similarly, whenever a broken or missing multicast notification isdetected, the detecting recipient waits before requesting the datadirectly.

[0321] Random waiting has two direct benefits. First, if there arecaching servers between the publisher and the subscriber, random waitingincreases the likelihood that only one recipient will request thecontent of the publisher, with the other recipients getting a cachedcopy. Second, even if there are no caching servers in the loop, randomwaiting distributes the load at the publisher. Most publishers are setup to deal with high average request volumes, the notification processalready eliminates spurious polling, multicast notifications will almostalways be reliable, and the result is that load at the server should bemanageable.

[0322] 3.6.2.7 Return-Receipt

[0323] “Return-receipt” means that the publisher needs confirmation fromeach subscriber that a notification was received. There are obviousperformance impacts, because the scheme requires both a subscriber listand direct communication between each subscriber and the publisher. Theimpact is most severe in an unreliable multicast environment. The systemgoes from one where a single copy is delivered to many unknownrecipients with a high likelihood of reliability, to one where thatprocess is followed up by N acknowledgments sent back to a publishingserver which must now also maintain a list of all subscribers. Theadditional overhead is least felt in unicast environments, where eachsubscriber is already in contact with the publisher.

[0324] Return-receipt requires that the publisher maintain a list of itssubscribers and mark each subscriber as having received thenotification. A database is a logical choice for the list, sincereturn-receipt subscriptions may well be highly personalized, or requirepayment, in which case a database with other subscriber information mayalready be in place. Database entries are created at subscription timeand removed at unsubscription time.

[0325] There are also caching issues with some classes of return-receiptsubscriptions. If the information is widely shared in a unicastenvironment, it still cannot take advantage of caching, since cachedcopies would by definition not be requested of the publisher, whichwould lose any return-receipt information. Instead, any URLs whichidentify return-receipt content must be parameterised by subscriberidentifier so that the publisher can determine the subscriber whoreceived (multicast) or is receiving (unicast with synchronous request)the content. The HTTP operation must also be marked by the requestingsubscriber as “no-cache”, i.e. do not serve a cached copy. Finally, inorder to keep the caching server from caching many copies of the data inone embodiment, HTTP 1.1 cache control operations can be used by thepublisher to prevent content being cached. According to anotherembodiment, in the HTTP 1.0 environment, an expiration date in the pastserves the same purpose.

[0326] Return-receipt operations differ in multicast and unicastnotification environments. In the unicast world, the return-receipt isimplicit and occurs at the same time the content is requested. All thatneeds to be done is for the URL and the request to circumvent caching asdescribed above. In the multicast world, the subscriber already has thedata, and the return-receipt becomes a simple post with no data, whereagain the URL and request are formatted as described above. Thesubscriber must perform a random wait just as it would in the unicastworld, to avoid inundating the publisher with requests.

[0327] 3.6.2.8 Backup Polling Algorithm

[0328] If the current polling mechanism proceeds on its own, it mayrequest information when that information is already up to date, causingspurious requests of the server and lowering performance. Instead,according to one embodiment of the present invention the pollingmechanism advances its timer by the polling interval any time one of thefollowing events occurs:

[0329] The subscriber receives an unreliable ping and follows with arequest of the publisher

[0330] The subscriber receives a broken multicast notification andfollows with a request of the publisher

[0331] The subscriber receives a valid multicast notification

[0332] By advancing its timer, polling becomes a true backup mechanism,used only if no notifications arrive, or if a multicast transmissionbreaks. The publisher controls the polling interval via the subscriptiondefinition. The interval should be matched to the information's updatefrequency or to its timeliness, depending on how much the publishertrusts the notification mechanisms.

[0333] 3.6.2.9 Notification Proxy Server

[0334] A notification proxy server implements the notification mechanismand uses it on behalf of a subscriber community. The proxy server storesincoming notifications, and subscribers poll the proxy periodically forany new notifications using HTTP. The proxy also stores authenticationdata so that only registered subscribers can use the proxy, and eachsubscriber has access only to its own notifications.

[0335] Candidates for proxy use are:

[0336] Laptops which are frequently disconnected from the network

[0337] Dialup users

[0338] DHCP users operating in a unicast environment

[0339] An intranet of subscribers whose network administrator does notwant notifications crossing a firewall and also does not want thefirewall and network overhead of all the clients doing timed pollthrough the firewall.

[0340] Laptops and dialup users are candidates because they are off thenetwork often, and are therefore likely to miss notifications. Thatforces them to use timed poll more often than other subscribers, whichmay present an unacceptable load on the server or the network. Polling alocal proxy may be more efficient, since a number of proxies candistribute the polling load.

[0341] DHCP users are candidates because publishers cannot effectivelymaintain subscriber lists when the addresses keep changing, and DHCPaddresses change constantly. The publisher can try to use host names foraddress transparency, but desktop clients frequently do not have hostnames because they do not provide services. Note that in a multicastenvironment, DHCP hosts work fine, because they can join groupsanonymously. The proxy is only useful in a unicast environment.

[0342] 3.6.2.10 Subscriber List Management

[0343] Subscriber list management is only an issue in the followingsituations:

[0344] return-receipt subscriptions

[0345] unicast notification

[0346] List management operations are performed during subscribe andunsubscribe operations. The subscribe and unsubscribe URLs referenceprograms or functionality built into the web server that manage adatabase. In the return-receipt case, that database probably alreadyexists, for payment or personalisation management. The list of contentscan be any unique identifier, since it will be given the subscriber atsubscription time, and the subscriber will place that identifier in itsrequest URL at content request or receipt confirmation time.

[0347] Lists maintained for unicast notification delivery are IPaddresses or host names, since a UDP ping protocol packet are sent toeach member of the list. IP addresses are easier to deal with than hostnames. DHCP uses either a timed poll or a proxy with a stable address.

[0348] 3.6.2.11 Notification Filtering

[0349] The notification system guarantees that users will see no furthernotifications from a publisher once they remove that subscription.Whenever a notification arrives, its subscription identifier is checkedagainst the list of susbcriptions currently active. Any subscriptionsnot on the list are ignored. In addition, the driver handling thenotification can generate further unsubscribe requests and send themback to the publisher, in cast the original removal request was lost.

[0350] There are lower-level driver-specific filtering mechanisms aswell; the filtering described above is a final backstop that isguaranteed to keep unwanted notifications from reaching the user. Forexample, in a multicast service that does not use return-receiptfunctionality, the driver can unsubscribe by simply leaving thesubscription's multicast group. The publisher is never given any sort ofclient network address, so it has no means of reaching the client oncethe client unsubscribes.

[0351] 3.6.3 Subscription Configuration

[0352] This section describes the process by which the notificationsystem creates subscriptions. From the user's point of view, the act ofsubscribing to content is a very simple one-step operation. Inparticular, the user does not need to get involved in selectingnotification mechanisms or configuring any subscription properties. Allconfiguration is performed by negotiation between the client and thepublisher, subject to any rules imposed by the client's local networkenvironment.

[0353] When the client starts up, it configures itself according tooptional meta-data that it fetches via a special configurationsubscription built into the caching server (see section 3.3 ClientConfiguration). Part of that meta-data consists of a set ofconfiguration HTML tags, one per notification driver. Each tag isidentified by the driver's name and contains a set of driver-specificattributes used to configure the driver for use in the client's localnetwork environment. Local network administrators can choose, forexample, to disable certain drivers, guaranteeing that they will neverbe used to carry notifications. For example, the ICMCAST driver isdisabled when the administrator puts the following configurationmeta-data in their configuration page:

[0354] <ICMCAST DISABLE=YES>

[0355] Similarly, administrators can use the ICNOTIFYCONFIG meta-datatag to choose the negotiation order in which different services will besent to the publisher, and can bind these negotiation orders to hostname regular expressions: <ICNOTIFYCONFIG DRIVER_LIST=“ICMCAST,ICDOORBELL” HOST_REGEXP=“.*\.incommon.com”>

[0356] The ICNOTIFYCONFIG tag allows administrations to force clients touse one order when communicating with a particular host or domain, andanother order for another host or group of hosts.

[0357] Once the client configures its notification services according tothe wishes of the local network administrator, subscription becomes amatter of negotiating a driver with the publisher, and configuring thesubscription according to its definition. When the client subscribes, itsends an HTTP request to the publisher. The request contains a list ofdesired notification drivers in preference order:

[0358] X-inCommon-Driver-List: ICMCast, ICDoorbell

[0359] The request also contains a configuration name/value pair foreach driver, describing the driver such that the publisher will be ableto send notifications to it. Each driver has its own configuration data,and not all drivers need this configuration information. The unreliableping service, for example, needs to supply a TCP port number for thepublisher's notifications:

[0360] X-inCommon-Driver-ICDoorbell: <ICDOOR LISTEN_PORT=2287>

[0361] The publisher picks the first driver on the driver list thatmatches a driver it is capable of using to send notifications. It thenperforms any registration that it needs to (entering subscriberinformation in a database, for example), and returns the subscription asan HTML page. The page contains all the information required for theclient to receive notifications, including:

[0362] backup polling interval

[0363] custom scheduling

[0364] use of special services such as return-receipt

[0365] notification driver to use and its driver-specific configurationdata

[0366] At this point the subscription process is complete, andnotifications can begin flowing from the publisher to the client.

[0367] The notifications that arrive at the caching server canthemselves modify their subscription's meta-data, changing its pollinginterval for example, or switching notification services from polling tomulticast.

[0368] 3.7 Generalized Reporting

[0369] The back-end server is also able to control caching serverreporting via meta-data. Reporting meta-data is stored in a sitemeta-data page, just like other meta-data, such as ICLOOKAHEAD,ICEXPIRES, and TOC pointers. Reports are defined using the ICREPORT HTMLtag. Each ICREPORT defines an internet domain, a set of filteringregular expressions, a report type, and an upload schedule.

[0370] The caching server periodically scans its cache as controlled bythe upload schedule. Every piece of cache content whose URL matches theICREPORT's domain is tested against the ICREPORT's filter regularexpressions. The filtering expressions consist of a “match” filter andan optional “no-match” filter. Each URL must both match the “match”filter and not match the “no-match” filter if it is present.

[0371] Each piece of cache content that passes the filter then hasinformation extracted from it that is appropriate to the ICREPORT'sreport type. Typical reports include hit counts, performance statistics(time required to fetch the content), or context in which the contentwas fetched (subscription, browser). Other reports can be created asneeded, and identified by a new report type.

[0372] The reporting mechanism allows any content provider to get one ormore reports on any subset of their content as stored in all clientcaches that access content owned by the publisher. The user need notsubscribe to this content; they just need visit the site, whereupon thesite meta-data is retrieved by the caching server and the report uploadis configured. Because the filtering mechanism uses regular expressions,the publisher can create several ICREPORT meta-data tags, each definingreports for a different subset of their content.

[0373] The ICREPORT tag has the following attributes:

[0374] DOMAIN_NAME: the internet domain of which all matching content inthe report must be a member. This domain name must also match the domainname of the site meta-data page's URL, thus preventing malicious contentproviders from getting report information on content that they do notown.

[0375] MATCH_REGEXP: the regular expression which cache content mustmatch in order to be reported on.

[0376] NOMATCH_REGEXP: the regular expression which cache content mustnot match in order to be reported on. This attribute is optional.

[0377] REPORT_TO_INTERVAL: the interval in seconds between reportuploads.

[0378] REPORT_TO_URL: the URL to which the report is delivered, via anHTTP POST.

[0379] REPORT_TYPE: the type of report desired. If this attribute ismissing, the report type defaults to a hit count report.

[0380] 4. Technology Local to Caching Server

[0381] The following sections describe technology local to the cachingserver. Algorithms in this section provide intelligent cache managementand use of network resources without the need for input from a back-endserver. Accordingly, these algorithms cannot provide the fine tuningthat is possible with interaction from a back-end server, but do providesome acceleration on any web site, even if that site does not have aback-end server.

[0382] 4.1 Automatic Expiration Control

[0383] 4.1.1 Overview

[0384] Whenever the server is asked to retrieve content from the web,the server places the content in local storage while returning thecontent to the requester (either the browser, a subscription, or theserver itself). The server then satisfies subsequent requests for thesame content from the local storage rather than going to the network.This strategy improves performance but is achieved at a cost. If thecontent changes at its origin, the caching server will deliver an oldcopy of the data from local storage, rather than the new copy from thenet.

[0385] The caching server solves this problem by assigning each piece ofcontent an expiration date. The server satisfies requests for cachedcontent from local storage until the expiration date is reached, afterwhich time it checks at the origin site to see if the content haschanged. The origin site may tell the server what the expiration dateis, based on knowledge of the content (see Section 3.3, CustomExpiration Control). There may be sites, however, that do not know theircontent's expiration behavior. In these cases the caching server isforced to invent an expiration date. The quality of the algorithm usedto invent the expiration date is extremely important. If it is tooliberal, the caching server will serve out-dated content from the cache.If it's too conservative, the server must access the Web frequently,thus reducing performance.

[0386] 4.1.2 Algorithm Pseudo-Code

[0387] Following is a pseudo-code description of the expiration datecomputation algorithm according to one embodiment, followed by detailedexplanations of the algorithm components. if(documentChanged &&accessedMoreRecentlyThanModified) then  AddNewLifetimeSample endifif(document has no modification data) then  expiration = now + fixedamount else if(document has not changed) then  if(document has lifetimesamples) then   expiration = now + one sample variance  else  expiration = now + ((now − last modification date)/2)  endif elseif(document has changed) then  if(document has lifetime samples) then  expiration = last modification date + mean lifetime − one samplevariance   if (expiration is in the past) then  expiration = now + 1sample variance   endif  else   expiration = now + ((now − lastmodification date)/2)  endif endif

[0388] 4.1.3 Lifetime Samples

[0389] The algorithm used by the server attempts to “learn” theexpiration behavior of each piece of content by tracking itsmodification history. Every time the content is accessed from thenetwork, its last-modification date is recorded. If thatlast-modification date changes and is accessed subsequently to thatchange, then the object has not changed since the currentlast-modification date and the access time. That time interval can thusbe treated as a sample of how often the object changes, i.e. itslifetime.

[0390] Each sample is plugged into an estimator algorithm that trackstwo quantities: the mean lifetime estimate M and the variance inlifetime samples V. As each new sample is added, the mean and variancechange. The variance is weighted more heavily toward recent behavior.The estimator algorithm used is identical to that used in the TCP/IPtransport protocol for network round-trip estimation, and is also knownas “Jacobsen-Karel Estimator”. The known estimator algorithm is utilizedin a novel manner as described below.

[0391] 4.1.4 Case 1: Lifetime Samples Exist

[0392] As long as the server owning the content supplieslast-modification dates with its content, this scheme works fairly well.The server accumulates a history of samples for each piece of contentand stores them permanently. Whenever an object expires, i.e. it isrequested and the caching server sees that the content's expiration dateis in the past, the caching server goes to the network and asks thecontent owner for a new copy of the object. The caching server then usesits samples in one of two ways to create a new expiration for thecontent. Which method it uses depends on whether the content hasactually changed or not.

[0393] If the content has not changed, then the previous expiration datewas too conservative, i.e. too short. Another sample cannot beaccumulated because the object has not changed. Instead, the serverfinds itself in a “grey zone” where its data indicates the object shouldexpire, but the owning server indicates that the object has not expired.In this situation, the server simply adds a single variance V from thecontent's estimator to the current time and uses that as the newexpiration date. This value allows the server to provide someperformance benefit (by continuing to cache the object). The variancemakes a good “fudge factor” (so called because the server is operatingwith insufficient data and must guess at a reasonable new expirationdate) because it measures the difference between the accumulated samplesand their mean. The variance therefore provides at least someapproximation to a valid sample, even if it is statistically less likelythan the mean.

[0394] If the content has indeed changed, then the server adds to thecontent's most recent last-modified date the current value M from theestimator algorithm. In order to lessen the likelihood that the serverwill expire the content too late, it then subtracts from the result thecurrent variance value V. The single variance V is chosen for the samereasons as in the previous paragraph. If V is sufficiently large thatthe last-modified date plus M minus V is before the current time, i.e.the object will already have expired, then technically the object is ina “grey zone” where it could expire at any moment. Given that thecaching server wants to provide some performance benefit even underthese circumstances, it creates an expiration date that is as small aspossible while still allowing the object to be cached. Again, the serverchooses a single variance V, for the same reasons as described above.

[0395] 4.1.5 Case 2: No Sample Data

[0396] If the caching server has no samples, it still attempts toprovide a rational expiration date, but it must do so with less data togo on. Again, there are two algorithms used: one if the content actuallychanged, and one if the content did not change and the previousexpiration date was too conservative.

[0397] If the content has not changed, the caching server constructs anew expiration date which is the current time plus the differencebetween the current time and the time the content was last modified.Although this value is not as good as one derived from watching theobject's modification history, it works reasonably well. The object isnot modified in the interval between its last-modification date and thecurrent time. It is thus predicted to not be modified for the sameamount of time into the future, although with little certainty. Thatlack of certainty is reflected in taking half of the interval ratherthan a full interval. Again, the idea is for the algorithm to balanceaccuracy (always handing back the most recent content) with performance(always satisfying requests from the cache).

[0398] If the content has changed, the frequency with which the objectchanges (one interval between now and the time last modified) is likelyto be an inaccurate estimate. The object is ideally served out of thecache to maintain performance, but must also be accurate. The algorithmin both situations works in exactly the same manner.

[0399] 4.1.6 Case 3: No Data at All

[0400] If the originating server is unfriendly, and provides nomodification data, there is even less data for the caching server to goon. It must thus make an expiration estimate that is essentially a wildguess. What it does in this situation is add a configurable amount oftime to the current date. The time is based on site meta-data ifprovided, or a constant of the implementor's choosing.

[0401] 4.2 Cache Compaction

[0402] 4.2.1 Overview

[0403] As the server's cache fills with content, eventually the cachebecomes large enough that its use of resources begins to affect theclient machine's performance. To forestall this situation, the cachingserver automatically compacts its cache periodically. The cache sizestarts at a standard value (measured in megabytes of content), which canbe changed by the end user. Whenever the cache size exceeds its ceiling,compaction occurs.

[0404] The most important part of the compaction process is decidingwhich pieces of content to remove from the cache and which to keep. Ifthe algorithm is not discriminating enough, then content which the useraccesses frequently is just as likely to be removed as content which theuser has rarely seen. Lookahead complicates the situation, since bydefinition it is predictive and if it fails to predict correctly,content will be cached that the user never accesses. Size of the contentalso complicates things, since large content is more expensive to fetchthan small content.

[0405] The compaction algorithm described below takes into account anumber of factors which together decide accurately which content to keepand which to throw out.

[0406] 4.2.2 Compaction Algorithm

[0407] Compaction is an expensive operation relative to other operationsperformed by the caching server. In order to keep compaction fromoccurring too often, the algorithm does not simply remove content untilthe cache size returns to its allowable maximum. Instead, the algorithmcompacts the cache to 75% of its former size, allowing some head roomfor the cache to grow before compaction again occurs. 75% is areasonable compromise between frequency of compaction (if the percentagewere higher) and lack of desired content and corresponding lowerperformance (lower percentages).

[0408] According to one embodiment, the compaction algorithm measuresthe following properties of each piece of content:

[0409] when it was last accessed

[0410] how much network resource is required to retrieve the content

[0411] how often it is accessed

[0412] These three properties are normalized into a score using thealgorithm described below. All content is then ordered by score, atwhich point compaction becomes a simple process of removing theworst-scoring content until the overall cache size drops below theceiling. The algorithm places paramount importance on the time a pieceof content was last accessed. If the content has never been accessed,then the importance is given to the time it arrived in the cache.Content that is old needs to be removed quickly. Second in importance isfrequency of access. Even old content should remain in the cache if itis accessed often enough. Size is least important, unless the content istruly huge, in which case the algorithm can keep significantly morepieces of content in the cache by deleting a single large object, andshould probably do so. The scoring algorithm normalizes these threecomponents (size, last-access, and frequency of access) into a singlevalue, where high scores indicate a more suitable candidate for removal,and low scores a more suitable candidate for retention.

[0413] 4.2.2.1 Last-Access Component

[0414] The last-access part of the score is highly non-linear,approximated with piece-wise linear approximation. Content is assumed tobe most useful in the first eight hours after it is accessed. It thenratchets down in usefulness if between eight hours and four days haselapsed since the content has been last accessed. According to oneembodiment, once a piece of content has not been accessed in more thanfour days, it is deemed useless unless accessed often prior to that mostrecent access. If a piece of content is never accessed, the algorithmassigns an initial last access time equal to the time the contentarrived in the cache. That is, all pieces of content are “accessed” atleast once, with that time equal to the arrival time.

[0415] The algorithm assigns one point to the last-access component foreach minute since the content has been accessed, up to a maximum ofeight hours worth, or 480 points:

480=8*60

[0416] For each minute over eight hours but below four days since thecontent has been accessed, the algorithm assigns half a point to thelast-access component, up to a maximum of 3120 points:

3120=480+(((96 hours−8 hours)*60 minutes)/2)

[0417] Finally, for each minute over four days since the content hasbeen accessed, the algorithm assigns four points to the last-accesscomponent. The dramatic increase in the slope of the function representsour belief that content not accessed in the past four days old is veryunlikely to be used again unless it has been accessed often before that.The increased slope is designed to be large enough that content accessedonce or twice will get a high score and be removed, but small enough sothat content accessed 10-20 times will reduce the score to the pointthat the object might not be removed. There is no maximum on the numberof points assigned in this manner. For example, at a approximately amonth after most recent access, the score is:

480+(((96 hours−8 hours)*60 minutes)/2)+(28 days−4 days)*24 hours*60minutes*=141360

[0418]4.2.2.2 Size component

[0419] The algorithm is non-linearly biased against “large” content. Thealgorithm divides all content up into three zones based on size. Thezone boundaries are based on the distribution of content sizes in theinternet. Zone 1 content is typically HTML, and must be less than 16kilobytes in size. Zone 2 content is typically inline images and rangesin size from 16 to 40 kilobytes. Zone 3 content is typically largeimages, audio files, and full-motion video, and is larger than 40kilobytes.

[0420] To approximate the desired non-linear behavior, the algorithmstarts with a “score” equal to the content's size. If the content is inzone 2 or 3, the score is increased by the amount that the contentexceeds 16 KB. If the content is in zone 3, the score is furtherincreased by the amount that the content exceeds 40 KB. The algorithmconsiders high scores as more likely for compaction than low scores. Asan example, a piece of content 12 KB in size gets a score of 12. A pieceof content that is 31 KB in size gets a score of 31+15=46. Finally, apiece of content that is 122 KB in size gets a score of 122+106+82=310.The reason the algorithm gives a disproportionately large score to largecontent is that it can keep far more content in the cache by deleting asingle large piece of content than a number of smaller pieces ofcontent. By increasing the score related to size non-linearly, thealgorithm is far more likely to remove big content than smaller content.

[0421] The formula for size-related scoring is thus:

S=size+max(0,size−16K)+max(0,size−40K)

[0422] 4.2.2.3 Frequency Component

[0423] The frequency component is also non-linear. The reason for thenon-linearity is that there is a large class of content which isaccessed either never or only once. Content that is never accessedcould, for example, have been looked ahead on by the caching server butnever actually seen by the user. Other content never accessed couldinclude headlines loaded by the caching server on behalf of a channel,but never read by the user.

[0424] Content accessed zero times or one time may be maintained in thecache for an eight hour period and then disposed of as quickly aspossible. Content accessed more than once, however, increasesdramatically in importance, because it probably is not a headline thatwas read once and discarded, but rather a more useful piece of content.

[0425] The algorithm uses a step function to reflect this property ofincoming content. Content accessed zero times is assigned a frequencycomponent term of 1. Content accessed once is given a value of 2.Content accessed more than once is given a value of the number ofaccesses plus 3. The function is therefore linear with a “jump” at twoaccesses.

[0426] 4.2.2.4 Complete Algorithm Formula

[0427] According to one embodiment, the complete scoring algorithmformula for the compaction score C is:

C=(S+L)/F

[0428] Where S is the size term, L is the last-access term, and F is thefrequency term.

[0429] High scores describe content that should be removed, and lowscores describe content that should be kept. A high score thereforearises either from a combination of a low number of accesses, alast-access time far in the past, and a large size.

[0430] The algorithm has the following desirable properties:

[0431] the value of L is large compared to that of S as time increases;that means the more time that has elapsed since the content has beenaccessed, the less important the content's size is. S is most able toaffect the score in the first eight hours since content has beenaccessed, where the value of L is fairly small.

[0432] The larger F, the smaller the score. Most content deserving ofremoval is accessed once (F=2) or not at all (F=1), making the frequencycomponent essentially irrelevant and lending all important to S and L.

[0433] As content is accessed more than once, the value of F jumps from2 to 5 and then linearly after each access. That means even if a 40Kpiece of content has gone a week without being accessed, if it wasaccessed 50 times before that, its score:

(64+20400)/53=386

[0434] Is approximately the same as that of the same piece of contentaccessed only once in the past eight hours.

(64+480)/2=272

[0435] This property is particularly useful for boilerplate graphicsthat are not accessed for a while. Such graphics can be expensive todownload, and the algorithm should try keep them in the cache if theyare shared by many web pages.

[0436] 4.3 Bandwidth Management

[0437] 4.3.1 Overview

[0438] The caching server is responsible for making sure the user'sbrowsing experience is as good as possible. While lookahead improves theuser experience in the long run, in the short run lookahead can detractfrom the user's experience if background lookahead processing is takingup network bandwidth while the user is performing foreground browsingoperations. The caching server has several mechanisms to managebandwidth and optimize use of the network. FIG. 8 is a flow chartillustrating an overview of one embodiment of bandwidth management.

[0439] 4.3.2 Bandwidth Sharing

[0440] According to one embodiment of the present invention, every taskperformed by the caching server runs at a particular priority. Requesttasks are responsible for taking a request made internally by thecaching server or by the browser and satisfying it, either from thecache or from the network. If the request is satisfied through thenetwork, it must share bandwidth with other requests in a mannerconsistent with its priority. Browser requests, for example, must getmore of the available bandwidth than lookahead requests.

[0441] According to one embodiment, a standard sockets interface is usedin a special algorithm that implicitly manages bandwidth according torequest priority. Each TCP connection managed by Windows Sockets has a“window” which represents the amount of data on that connection that canbe in transit before transmission stops and awaits acknowledgment fromthe destination endpoint. The larger the window, the more data can be intransit. Incoming TCP data is buffered for delivery to the application,which reads the data, causing it to be acknowledged to the sender, whichin turn opens the transmission window for more data.

[0442] By not reading a TCP connection, an application can prevent datafrom being acknowledged and therefore more data from being transmitted.By selectively reading and not reading different TCP connections, thecaching server can control the amount of network bandwidth taken by eachconnection. The only problem with this solution is the lag between thetime the caching server stops reading a connection and the time thesender's window is exhausted. A large window can keep sufficiently largeamounts of data “in the pipe” that ceasing to read a connection does notimmediately lower that connection's share of the bandwidth.

[0443] To solve this problem, each TCP connection's transmission windowis adjusted according to request priority. High-priority requests getthe largest window recommended (typically 8192 bytes), and low-priorityrequests get a small window, typically one or two network MaximumTransmission Units (MTU). When a low-priority request is running, itwill run more slowly, and with less data outstanding, than ahigh-priority request. As soon as a high-priority request enters thesystem, data from all lower-priority requests are ignored. Sincelow-priority requests have small TCP windows, amount of data still inthe pipe from those connections is quite small (and completelycontrolled by setting the low-priority request window size), and ceasingto read the connections causes an immediate drop in the bandwidthconsumed by them, bandwidth which can then be taken by the high-priorityconnections.

[0444] As soon as there are no more high-priority requests, the cachingserver once again begins reading from lower-priority requestconnections, opening up their windows. The server does not begin doingso immediately. Instead, it assumes that one high-priority request willgenerate other high-priority requests (as an HTML page request by abrowser will lead to requests for that page's in-line images, forexample). Every time high-priority traffic travels through the server,the server advances a timer. The duration of the timer is a measure ofhow long the server is willing to wait before it decides there are noimmediately following high priority requests. Only when the timerexpires does the server begin reading from lower-priority connections.

[0445] There are a number of variations on this algorithm that theserver can employ as needed. As it measures the number and frequency ofhigh-priority connections, it can further improve the performance oflow-priority connections by opening their windows. When the serverbelieves enough time has passed that the presence of a high-priorityrequest is increasingly likely, it can begin shrinking the size oflow-priority request windows, so that when a high-priority requestfinally does arrive, the low priority requests have small windows andtherefore small amounts of data in the pipe and minimal impact onavailable bandwidth.

[0446] Thus, a method and apparatus for storage and delivery ofdocuments on the Internet is disclosed. The specific arrangements andmethods described herein, are merely illustrative of the principles ofthe present invention. Numerous modifications in form and detail may bemade by those of ordinary skill in the art without departing from thescope of the invention. Although this invention has been shown inrelation to a particular preferred embodiment, it should not beconsidered so limited. Rather, the present invention is limited only bythe scope of the appended claims.

We claim:
 1. A method for validating a collection of data, the methodincluding: receiving a request for data in the collection of data, eachdata in the collection of data associated to an expiration information,the collection of data identified by associating similar expirationinformation, the collection of data associated with a table of contents(TOC); examining the TOC to determine whether the TOC is expired;updating the TOC if the TOC is expired; and validating the collection ofdata with the TOC.
 2. The method of claim 1, wherein the expirationinformation includes a time the data was last modified.
 3. The method ofclaim 1, wherein the TOC further includes an expiration time.
 4. Themethod of claim 3, wherein examining the TOC includes determiningwhether the expiration time has expired.
 5. The method of claim 1,wherein the data is a network object.
 6. The method of claim 5, whereinthe network object is referenced with a URL.
 7. The method of claim 5,wherein the network object is provided by a content provider.
 8. Anmethod for validating a collection of data, the method including:defining a table of Contents (TOC) for the collection of data, each datain the collection of data associated to an expiration information, thecollection of data identified by associating similar expirationinformation; receiving a request for a TOC; and sending the TOC tovalidate the collection of data.
 9. The method of claim 8, whereindefining includes scanning file system directories recursively.
 10. Themethod of claim 8, wherein creating includes scanning HTML content forreferences to a plurality of HTML content and following the plurality ofHTML content recursively.
 11. The method of claim 8, wherein creatingincludes invoking a content management system via an ApplicationProgramming Interface (API).
 12. An apparatus for validating acollection of data, the apparatus including: a notification mechanism torequest a data associated with the collection of data, each data in thecollection of data associated to an expiration information, thecollection of data identified by associating similar expirationinformation, the collection of data associated with a table of contents(TOC). a caching server to receive the request, examine the TOC todetermine whether the TOC is expired, update the TOC if the TOC isexpired and validate the collection of data with the TOC.
 13. Theapparatus of claim 12, wherein the notification mechanism includespolling.
 14. A system to validate a collection of data, the system afirst means to request a data in the collection of data, each data inthe collection of data associated to an expiration information, thecollection of data identified by associating the expiration information,the collection of data associated with a table of contents (TOC); and asecond means to receive the request, examine the TOC to determinewhether the TOC is expired, update the TOC if the TOC is expired andvalidate the collection of data with the TOC.
 15. A computer softwareproduct including a medium readable by a processor, the medium havingstored thereon a sequence of instructions which, when executed by theprocessor, cause the processor to: receive a request for a data in thecollection of data, each data in the collection of data associated to anexpiration information, the collection of data identified by associatingsimilar expiration information, the collection of data associated with atable of contents (TOC); examine the TOC to determine whether the TOC isexpired; update the TOC if the TOC is expired; and validate thecollection of data with the TOC.
 16. A computer software productincluding a medium readable by a processor, the medium having storedthereon a sequence of instructions which, when executed by theprocessor, cause the processor to: define a table of Contents (TOC) forthe collection of data, each data in the collection associated to anexpiration information, the collection of data identified by associatingsimilar expiration information; receive a request for a TOC; and sendthe TOC to validate the collection of data.